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Abstract 

Joint peak detection is a central problem when comparing samples in genomic data analysis, but current algorithms 
for this task are unsupervised and limited to at most 2 sample types. We propose PeakSegJoint, a new constrained 
maximum likelihood segmentation model for any number of sample types. To select the number of peaks in the 
segmentation, we propose a supervised penalty learning model. To infer the parameters of these two models, we 
propose to use a discrete optimization heuristic for the segmentation, and convex optimization for the penalty learning. 

In comparisons with state-of-the-art peak detection algorithms, PeakSegJoint achieves similar accuracy, faster speeds, 
and a more interpretable model with overlapping peaks that occur in exactly the same positions across all samples. 
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1 Introduction: Joint supervised peak detection in ChIP-seq data 


Chromatin immunoprecipitation sequencing (ChIP-seq) is an experimental technique used for genome-wide profiling 
of histone modifications and transcription factor binding sites [Bailey et al.||2013i . Each experiment yields a set of 
sequence reads which are aligned to a reference genome, and then the number of aligned reads are counted at each 
genomic position. To compare samples at a given genomic position, biologists visually examine coverage plots such 
as Figure [T] for presence or absence of common “peaks.” In machine learning terms, a single sample over B base 


positions is a vector of non-negative count data z S and a peak detector is a binary classifier i 


^4 


{o,ir 


The positive class is peaks and the negative class is background noise. Importantly, peaks and background occur in 
long contiguous segments across the genome. 

In this paper we use the supervised learning framework of Hocking et al. | 2014| , who proposed to use labels 
obtained from manual visual inspection of genome plots. For each labeled genomic region i G {1,..., n} there is 
a set of count data z^ and labels (“noPeaks,” “peaks,” etc. as in Figure [TJ. These labels define a non-convex 
annotation error function 

B[c(z,),L,] = FP[c(z,), L,] + FN[c(zO, Ti] (1) 


which counts the number of false positive (FP) and false negative (FN) regions, so it takes values in the non-negative 
integers. In this framework the goal of learning is to find a peak detection algorithm c that minimizes the number of 
incorrect labels Q on a test set of data: 


minimized E[c{zi),Li]. 

C • ^ 

idlest 


( 2 ) 
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Figure 1: Fabeled ChIP-seq coverage data for ^ = 4 samples (top 4 panels) with 3 peak models (bottom 3 panels). An 
ideal peak model minimizes the number of incorrect labels (false positives are too many peaks and false negatives are 
not enough peaks). The “good” model achieves 0 errors but the “better” model is more interpretable since overlapping 
peaks in different samples always occur in the exact same positions. 
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In practical situations we have 5" > 1 samples, and a matrix Z € of count data. For example in Figurewe 

have 5" = 4 samples. In these data we are not only interested in accurate peak detection in individual samples, but also 
in detecting the differences between samples. In particular, an ideal model of a multi-sample data set is a joint peak 
detector with identical overlapping peak positions across samples (the “better” model in Figure[^. 


1.1 Contributions and organization 

The main contribution of this paper is the |PeakSegJoint| model for supervised joint peak detection. Unlike previous 
methods (Section]^, it is the first joint peak detector that explicitly models any number of sample types (Section]^. 
A secondary contribution is the JointZoom heuristic segmentation algorithm, and a supervised learning algorithm 
for efficiently selecting the |PeakSegJoint model complexity (Section |^. We show test error and timing results in 
Section]^ provide a discussion in Section]^ and propose some future research directions in Section]^ 


2 Related work 

2.1 Single-sample ChIP-seq peak detectors 


There are many different unsupervised peak detection algorithms |Wilibanks and Facciotti| pOlO Rye et al. 2010 
Szalkowski and Schmidl 120111, and the current state-of-the-art is the supervised PeakSeg model of Hocking et al. 
112015| . 


In the multi-sample setting of this paper, each of these algorithms may be applied independently to each sample. 
The drawback of these methods is that the peaks do not occur in the same positions across each sample, so it is not 
straightforward to determine differences between samples. 


2.2 Methods for several data sets 

This paper is concerned with the particular setting of analyzing several samples (perhaps of different cell types) of the 
same experiment type. For example. Figure [T] shows 2 bcell samples, 1 monocyte sample, and 1 tcell sample (all of 
the H3K4me3 experiment type). As far as we know, there are no existing algorithms that can explicitly model these 
data. 

The most similar peak detectors in the bioinformatics literature model several samples of the same cell type. For 


example, the JAMM algorithm of Ibrahim et al. 120141 can analyze several samples, but is limited to a single cell type 
since it assumes that each sample is a replicate with the exact same peak pattern. So to analyze the data of Figure [T] 
one would have to run JAMM 3 times (once for each cell type), and the resulting peaks would not be the same across 
cell types. Another example is the PePr algorithm of [Zhang et al.|pOT4[, which can m odel either one or two cell types, 
but is unsuitable for analysis of three or more cell types. In contrast, PeakSegJointj is for data from several samples 
without limit on the number of cell types. 

Although not the subject of this paper, there are several algorithms designed for the analysis of data from several 
different ChIP-seq experiments [Ernst and Kellis 2012 Hoffman et al.j 2012| Zeng et al. 2013|. In these models, the 


input is one sample (e.g. monocyte cells) with different experiments such as H3K4me3 and H3K36me3. Also, [Cb 


oi 


et al. 1 2009 1 proposed a model for one sample with both ChIP-seq and ChIP-chip data. In contrast, jPeakSegJoint 
takes as input several samples (e.g. monocyte, tcell, bcell) for one experiment such as H3K4me3. 
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3 Models 


We begin by summarizing the single-sample PeakSeg model, and then introduce the multi-sample |PeakSegJoint 
model. 


3.1 PeakSeg: finding the most likely 0,..., p^ax peaks in a single sample 

This section describes the |PeakSeg| model of |Hocking et al. | 2015|, w hich is the current state-of-the-art peak detection 
algorithm in the McGill benchmark data set of Hocking et a i.|P014|. 

Given a single sample profile z € of aligned read counts on B bases, and a maximum number of peaks Pmax, 
the PeakSeg model for the mean vector m € with p G {0,... ,Pmax} peaks is 


iii^(z) = argmin PoissonLoss(m, z) 

meR-S 

such that Peaks(m) = p, 
VjG S}, P,(m)e{0,l}, 


(PeakSeg) 

(3) 

(4) 


where the Poisson loss function is 


B 

PoissonLoss(m, z) = rrij — Zj log nij . 
i=i 


(5) 


Note that the optimization objective of minimizing the Poisson loss is equivalent to maximizing the Poisson likelihood. 
The model complexity ([^ is the number of peaks 

Peaks(m) = (Segments(m) — l)/2, (6) 


which is a function of the number of piecewise constant segments 

B 

Segments(m) = 1 + ^ rrij-i). (7) 

Finally, the peak indicator at base j is defined as the cumulative sum of signs of changes 

3 

= '^sign{mk - ruk-i), ( 8 ) 

k^2 

with Pr(iii) = 0 by convention. The constraint (|^ means that the peak indicator Pj(m) G {0,1} can be used to 
classify each base j G {1,... 5} as either background noise Pj (m) = 0 or a peak Pj (m) = 1. In other words, iriP(z) 
is a piecewise constant vector that changes up, down, up, down (and not up, up, down). Thus the even numbered 
segments are interpreted as peaks, and the odd numbered segments are interpreted as background noise. 

Note that since |PeakSeg| is defined for a single sample, it may be independently applied to each sample in data 
sets such as Figure [1] However, any overlapping peaks in different samples will not necessarily occur in the same 
positions. In the next section we hx this problem by proposing the more interpretable multi-sample |PeakSegJoint| 
model. 
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3.2 PeakSegJoint: finding the most likely common peak in 0 ,...,^ samples 

For S sample profiles Zi,..., zs € defined on the same B bases, we stack the vectors into a matrix Z G of 

count data. Consider the following model for the mean matrix M G which allows each sample to have either 0 

or 1 peaks, with a total of p S {0,..., 5} peaks: 


s 

M^(Z) = argmin PoissonLoss(ms, z^) (PeakSegJoint) 

such that Vs G {1,, S'}, Peaks(ms) G {0,1}, (9) 

Vs G {1,..., Sj, Vj G {1,..., B}, P,(m,) G {0,1}, (10) 

s 

p = Peaks(ms) (11) 

S=1 

Vsi ^ S2 I Peaks(msJ = Peaks)!!!^^) = 1, Vj G B}, 

( 12 ) 


The hrst two constraints are similar to the |PeakSeg| model constraints. The constraint on the number of peaks per 
sample (|^ means that each sample may have either zero or one peak. The constraint ( |T0) | requires the segment mean 
mg of each sample to have alternating changes (up, down, up, down and not up, up, down). The overall constraint (111 
means that there is a total of p samples with exactly 1 peak. The last constraint (12 1 means that peaks should occur 
in the exact same positions in each sample. The |PeakSegJoinf| model for a data set with S = 3 samples is shown in 
Figure]^ 
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Figure 2: A labeled data set with S 
peaks (bottom 4 panels). 


3 samples (top 3 panels) and the PeakSegJoint models for p G {0,1,2,3} 
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3.3 Supervised penalty learning 


In the last section we considered data Z G for S samples in a single genomic region with B bases. Now assume 


Z„ G for n genomic regions, along with annotated region labels Li 


that we have data Zi, 

For each genomic region i G {1,..., n} we can compute a sequence of PeakSegJoint models M°(Zi),..., M'®(Zi), 
but how will we predict which of these 5" + 1 models will be best? This is the segmentation model selection problem, 
which we propose to solve via supervised learning of a penalty function. 

First, for a positive penalty constant A G K+, we dehne the optimal number of peaks as 


p*(A,Z)= argmin pA + PoissonLoss M^(Z), 
pe{o,...,S} 


(13) 


Also suppose that we can compute d-dimensional sample-specihc feature vectors x G and stack them to obtain 
feature matrices Xi,..., X„ G We will learn a function / : R'^^^ —>• K that predicts region-specihc penalty 

values /(Xi) = log A^ G K for each genomic region i. In particular we will learn a weight vector w G in a linear 
function /w(X) = w^Xlg, where Ig is a vector of S ones. 

For supervision we use the annotated region labels Li to compute the number of incorrect regions Q for each 

Briefly, a predicted 


model size p, and then a target interval = (y , 1 /^) of penalty values |Rigaill et al. 


2013 


penalty in the target interval /(X^) G implies that the PeakSegJoint model with p^ [exp / (X^), Z^] peaks achieves 
the minimum number of incorrect labels Li, among all S -f 1 |PeakSegJoint] models for genomic region i. 

A target interval y is used with the squared hinge loss (j){x) = {x — l)^/(x < 1) to define the surrogate loss 


y, logA = ^[logA-y]+/)[y-logA], 


(14) 


for a predicted penalty value A G K+. For a weight parameter w G the convex average surrogate loss is 

1 ” 

^(w) =/w(x,)]. (15) 

n 

1—1 

Finally, we add a 1-norm penalty to regularize and encourage a sparse weight vector, thus obtaining the following 
convex supervised penalty learning problem: 

w'’'= argmin£(w)-f 7 ||w||i. (16) 

weK'i 


To predict on test data Z with features X, we compute the predicted penalty A = exp /*(X), the predicted number 
of peaks p = p*(A, Z), and hnally the predicted mean matrix Mp(Z). Each column/sample of the mean matrix has 
either two changes (the second segment is the peak) or no changes (no peak). 


4 Algorithms 

4.1 Heuristic discrete optimization for joint segmentation 

The |PeakSegJoint| model is dehned as the solution to an optimization problem with a convex objective function 
and non-convex constraints. Real data sets Z G Z^^^ may have a very large number of data points to segment B. 
Explicitly computing the maximum likelihood and feasibility for all 0{B^) possible peak start/endpoints is guaranteed 
to hnd the global optimum, but would take too much time. Instead, we propose to hnd an approximate solution using 
a new discrete optimization algorithm called JointZoom (Algorithm 1). 
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The main idea of the JointZoom algorithm is to first zoom out (downsample the data) repeatedly by a factor of 
P, obtaining a new data matrix of size b x S, where b B (line[T]). Then we solve the PeakSegJoint| problem for p 
peaks via GridSearch (lineB, a sub-routine that checks all 0{b^) possible peak start and end positions. Then we 
zoom in by a factor of /3 (lineQ and refine the peak positions in 0(/3^) time via SearchNearPeak (line[^. After 
having zoomed back in to the BinSize=l level we return the final Peak positions, and the p Samples in which that peak 
was observed. 


Algorithm 1 JointZoom, available at https : //github. com/tdhock/PeakSegJoint 

Require: count data Z G number of peaks p G {0,..., S}, zoom factor /3 G {2,3,... }. 

1: BinSize ^ MAXBlNSlZE(i3,^). 

2 : Peak, Samples ^ GridSearch(Z,p, BinSize). 

3: while 1 < BinSize do 
4: BinSize ^ BinSize//?. 

5: Peak ^ SearchNearPeak(Z, Samples, BinSize, Peak) 

6 : end while 

7: return Peak, Samples. 


For example if we fix the zoom factor at /3 = 2, a demonstration of the algorithm on a small data set with B = 24 
points is shown in Figurej^ MaxBinSize returns 4, so GridSearch considers 15 models of 6 = 7 data points at 
bin size 4, and then SearchNearPeak considers 16 models each at bin sizes 2 and 1. In the real data set of Figure]^ 
there are i? = 85846 data points, MaxBinSize returns 16384, GridSearch considers 10 models of 6 = 6 data 
points, and then SearchNearPeak considers 16 models each at bin sizes 8192, 4096,..., 4, 2, 1. 
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Figure 3: Demonstration of the JointZoom segmentation algorithm. For a data set with B = 24 data points and 
S = 2 samples (top 2 panels), the algorithm with zoom factor of /3 = 2 proceeds as follows. First, the Poisson loss 


and feasibility (lOi is computed via GridSearch over all peak starts and ends at bin size 4. The feasible model 
with minimum Poisson loss is selected, the bin size is decreased to 2, and SearchNearPeak considers a new set 
of models around the selected peak starts and ends. We continue and return the optimal model at bin size 1 (shown in 
green). Interactive figure athttp://bit. ly/lAA6TgK 
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Overall JointZoom with zoom factor /3 searches a total of 0{f3^ logi?) models, and computing the likelihood 
and feasibility for each model is an 0{pB) operation, so the alg orithm has a tim e complexity of 0{fPpB\ogB). 
Thus for a count data matrix Z G computing the sequence of PeakSegJoint models M°(Z),..., M'^(Z) takes 

0{p'^SB log B) time. 

Finally note that when there is only S' = 1 sample, JointZoom can be used to find an approximate solution to 
the |PeakSeg| model with p = 1 peak. 


4.2 Convex optimization for supervised penalty learning 


The convex supervised learning problem can be solved using gradient-based methods such as FISTA, a Fast 


Iterative Shinkage-Thresholding Algorithm | 

Beck and Teboulle 20091. For FISTA with constant step size we also 

need a Lipschitz constant of >C(w). Following the arguments of Flamary et al. 

12012), we derived a Lipschitz constant 

of ll^illF/^' Finally^ we used the subdifferential stopping criterion of 

ligaill et al. 

2013 

|. 


5 Results 


5.1 Accuracy of PeakSegJoint on benchmark data sets 

We used McGill benchmark data sets from http: //cbio . ensmp . f r/ -thocking/chip-seq-chunk-db/ 
which included a total of 12,826 manually annotated region labels |Hocking et al. 2014) . Each data set contains 
labels grouped into windows of nearby regions (from 4 to 30 windows per data set). For each data set, we performed 
6 random splits of windows into half train, half test. Since there are multiple peaks per window, and |PeakSegJoint] 
can detect at most 1 peak per sample, we divided each window into separate [PeakSegJoint| problems of size B (a 
hyper-parameter tuned via grid search). 

We compared the test error of |PeakSegJoint with three single-sample models (Figure]^. In all 7 of the McGill 
benchmark data sets, our proposed PeakSegJoint| model achieved test error rates comparable to the previous state- 
of-the-art |PeakSe^ model. In contrast, the baseline hmcan.broad | Ashoor et al. |2013[ and macs [Zhang et al. 20081 
models from the bioinformatics literature were each only effective for a single experiment type (either H3K36me3 or 
H3K4me3, but not both). 
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Figure 4; Test error of peak detectors in the 7 McGill benchmark data sets (panels from left to right). Each dot shows 
one of 6 train/test splits, and the black vertical line marks the mean of the|PeakSegJoint|model. 
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5.2 Speed of heuristic PeakSegJoint algorithm 


The JointZoom algorithm is orders of magnitude faster than existing Poisson segmentation algorithms (Figure]^. 
For simulated single-sample, single-peak data sets of size B G {10^,..., 10®}, we compared JointZoom with 
the 0(i? log i?) unconstrained pruned dynamic programming algorithm (pDPA) of ||Cleynen et al. | 2014) (R p ackage 
SegmentorSIsBack), and the 0{B^) constrained dynamic programming algorithm (cDPA) of Hocking et al. |2015| 
(R package PeakSegDP). We set the maximum number of segments to 3 in the pDPA and cDPA, meaning one peak. 
Figure [^indicates that the JointZoom algorithm of PeakSegJoint enjoys the same 0{B \ogB) asymptotic behavior 
as the pDPA. However the JOINXZOOM algorithm is faster than both the cDPA and pDPA by at least two orders of 
magnitude for the range of data sizes that is typical for real data (10^ < B < 10®). 

These speed differences are magnified when applying these Poisson segmentation models to real ChlP-seq data 
sets. For example, the hgl9 human genome assembly has about 3 x 10® bases. For the H3K36me3_AMJmmune 
data set, a resolution of about 200 kilobases per problem was optimal, so running PeakSegJoint on the entire human 
genome means solving about 15,000 segmentation problems. PeakSegJoint takes about 0.1 seconds to solve each of 
those problems (Figure |^, meaning a total computation time of about 25 minutes. In contrast, using the pDPA or 
cDPA would be orders of magnitude slower (hours or days of computation). 



data size to segment B 

Figure 5: Timings of three Poisson segmentation algorithms on simulated data sets of varying size B. The grey shaded 
area represents the range of problem sizes selected in the McGill benchmark data sets. 
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6 Discussion 


The table below presents a summary of our proposed PeakSegJoint model alongside several other peak detection 
algorithms. Time complexity is for detecting one peak in B data points. The “?” of the unsupervised joint peak 
detectors means that their articles did not discuss time complexity [Zhang et al. 2008 Ashoor et al. 2013 [Ibrahim 


et al. 2014 Zhang et ak] 20141. 


Algorithm 

Samples 

Cell types 

Time complexity 

Learning 

macs 

1 

1 

? 

unsupervised 

hmcan.broad 

1 

1 

? 

unsupervised 

PeakSeg (cDPA) 

1 

1 

0{B^) 

supervised 

JAMM 

> 1 

1 

? 

unsupervised 

PePr 

> 2 

1-2 

7 

unsupervised 

PeakSegJoint (JointZoom) 

> 1 

> 1 

0(5 log .B) 

supervised 


6.1 Multi-sample PeakSegJoint versus single-sample PeakSeg 

The jPeakSegJointj model is explicitly designed for supervised, multi-sample peak detection problems such as the 
McGill benchmark data sets that we considered. We observed that the single-sample PeakSegj model is as accurate as 
jPeakSegJoint} in these data (Figure]^. However, any single-sample model is qualitatively inferior to a multi-sample 
model since it can not predict ov erlapping peaks at the exact same positions. Furthermore, the 0{B lo g B) JoiNT- 
ZOOM algorithm can compute the jPeakSegJoint model much faster than the 0{B^) cDPA of PeakSeg| (Figure [sjl . 


6.2 PeakSegJoint versus other joint peak detectors 

We attempted to compare with the multi-sample JAMM and PePr algorithms. The jPeakSegJointj model is more 
interpretable than these existing methods, since it is able to handle several sample types. This was advantageous in 
data sets such as Figure which contains 3 cell types; tcell, bcell, and monocyte. The PeakSegJoint model can be 
run once on all cell types, and the resulting model can be easily interpreted to find the differences, since peaks occur 
in the exact same locations across samples. In contrast, existing methods such as JAMM or PePr are less suitable to 
analyze this data set since they are limited to modeling only 1 or 2 cell types. We nevertheless downloaded and ran the 
algorithms separately on each sample type, but both JAMM and PePr produced errors and so were unable to analyze 
the McGill benchmark data sets. 


7 Conclusions 


We proposed the jPeakSegJoint| model for supervised joint peak detection. It generalizes the state-of-the-art jPeakSegj 
model to multiple samples. We proposed the JointZoom Poisson segmentation algorithm that is orders of magni¬ 
tude faster than existing algorithms. Finally, we showed that choosing the number of peaks in jPeakSegJointj using 


supervised penalty learning yields test error rates that are comparable to the previous state-of-the-art PeakSeg 


model. 


For future work, we would be interested in a theoretical analysis analogous to the work of jCleynen and Lebarbier 


1 2014), w hich could suggest the form of the optimal penalty function to choose the number of peaks in the Peak- 
jSegJointj model. 
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